-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-5593: Configure the max CrashLoopBackOff delay #5594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
hankfreund
commented
Sep 30, 2025
- One-line PR description: Splitting KEP-4603: Tune CrashLoopBackoff into two KEPs.
- Issue link: Configure the max CrashLoopBackOff delay #5593
- Other comments: No material changes have been made to either KEP. Content was removed from one or the other and grammar has been updated to make sense.
Welcome @hankfreund! |
Hi @hankfreund. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add a prod readiness file: keps/prod-readiness/sig-node/5593.yaml
cacae2d
to
511963e
Compare
|
||
#### Beta | ||
|
||
- Gather feedback from developers and surveys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any feedback? I'm not sure we want to block the beta on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
will rollout across nodes. | ||
--> | ||
|
||
<<[UNRESOLVED beta]>> Fill out when targeting beta to a release. <<[/UNRESOLVED]>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're targetting beta this release, this needs to be filled out. Or were you planning to cover this in a follow-up PR?
Risk is that a configured crashloop backoff causes the kubelet to become unstable. If that happens, a rollback just requires updating the config and restarting kubelet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure initially if I should do it all in one, but it makes sense. Updated this and all the following sections.
@@ -0,0 +1,3 @@ | |||
kep-number: 5593 | |||
alpha: | |||
approver: TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was already approved for alpha. You're just splitting the previous enhancement into 2 parts. I think you can put @soltysh here (copied from https://github.com/kubernetes/enhancements/blob/511963e97f955f97e9842ae3015b60af956539b3/keps/prod-readiness/sig-node/4603.yaml)
* `kubelet_pod_start_sli_duration_seconds` | ||
|
||
|
||
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but I think you can just put N/A here. This feature is stateless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can agree with the stateless fact, but I need those feature on/off tests linked in the previous sections. Then update this section mentioning that b/c it's stateless it's sufficient to verify that turning the feature gate on and off works as expected.
511963e
to
d7ce106
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments focused on clean separation between the two KEPs
(Success) and the pod is transitioned into a "Completed" state or the expected | ||
length of the pod run is less than 10 minutes. | ||
|
||
This KEP proposes the following changes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
This KEP proposes the following changes: | |
This KEP proposes the following change: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
Some observations and analysis were made to quantify these risks going into | ||
alpha. In the [Kubelet Overhead Analysis](#kubelet-overhead-analysis), the code | ||
paths all restarting pods go through result in 5 obvious `/pods` API server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't comment directly because its out of range of the diff but lines 499-508 contain references to the per Node feature still
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through and I think I got all of them.
included in the `config.validation_test` package. | ||
### Rollout, Upgrade and Rollback Planning | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On and after line 1149 (ref) in the Scalability section, the per Node feature is referenced, I suggest linking out to the other KEP there for inline context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
This KEP proposes the following changes: | ||
* Provide a knob to cluster operators to configure maximum backoff down, to | ||
minimum 1s, at the node level | ||
* Formally split image pull backoff and container restart backoff behavior |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this bullet point still be in the other KEP as well since it was also done alongside that (or alternatively, should this bullet point be taken out of the top level content for both)? I think it was important to include references to these refactorings above the fold during the alpha phase so it was clear what was happening, but less important now that the alpha is implemented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think removing it is all right.
rate limiting made up the gap to the stability of the system. Therefore, to | ||
simplify both the implementation and the API surface, this 1.32 proposal puts | ||
forth that the opt-in will be configured per node via kubelet configuration. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that this is the only feature referred to in this KEP, I feel like this section would read better with a subheading here like ### Implementing with KubeletConfiguration
or something. Before it was all smooshed together since there were already so many H3s lol but that's not the case anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
All behavior changes are local to the kubelet component and its start up | ||
configuration, so a mix of different (or unset) max backoff durations will not | ||
cause issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed that this sentence is kinda vague.
cause issues. | |
cause issues to running workloads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* Formally split backoff counter reset threshold for container restart backoff | ||
behavior and maintain the current 10 minute recovery threshold | ||
* Provide an alpha-gated change to get feedback and periodic scalability tests | ||
on changes to the global initial backoff to 1s and maximum backoff to 1 minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a sentence to both Overviews about how this was originally a bigger KEP that has been split out into two, and link to the other one there, so its quickly in context for new or returning readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* Formally split image pull backoff and container restart backoff behavior | ||
* Formally split backoff counter reset threshold for container restart backoff | ||
behavior and maintain the current 10 minute recovery threshold | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X-post from the other one: Consider adding a section to both Overviews about how this was originally a bigger KEP that has been split out into two, and link to the other one there, so its quickly in context for new or returning readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
1d0be4d
to
684ee23
Compare
reviewers: | ||
- "@tallclair" | ||
approvers: | ||
- TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrunalp can you take this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hankfreund please add @mrunalp here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
question. | ||
--> | ||
|
||
<<[UNRESOLVED beta]>> Fill out when targeting beta to a release. <<[/UNRESOLVED]>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to be filled up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I think all the required sections are filled out now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from sig node perspective - just copy of what we got to alpha already. Pretty straightforward.
Couple of PRR questions are unanswered
684ee23
to
6105155
Compare
/assign @mrunalp |
/lgtm (except approver needs to be listed) |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: hankfreund, mrunalp The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Tuning and benchmarking a new crashloopbackoff decay will take a lot of work. In the meantime, everyone can benefit from a per-node configurable max crashloopbackoff delay. Splitting the KEP into two KEPs to allow for graduating the latter to beta before the former.
This KEP is mostly a copy of keps/sig-node/4603-tune-crashloopbackoff with all the tuning bits removed (and grammar adjusted to make sense). The desire is to advance this KEP to beta sooner than we'd be able to advance the other one.
6105155
to
e181610
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a PRR pov mostly missing test links.
# The following PRR answers are required at alpha release | ||
# List the feature gate name and the components for which it must be enabled | ||
feature-gates: | ||
- name: ReduceDefaultCrashLoopBackoffDecay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: but above there's see-also section you could update to point to the other KEP:
see-also:
- "/keps/sig-node/5593-configure-the-max-crashloopbackoff-delay"
title: Configure the max CrashLoopBackOff delay | ||
kep-number: 5593 | ||
authors: | ||
- "@lauralorenz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I'm assuming a lot is copied from the other KEP, but still I'd add hankfreund
here
- [ ] (R) Production readiness review approved | ||
- [ ] "Implementation History" section is up-to-date for milestone | ||
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] | ||
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ni: please make sure to update this checklist, ✔️ the appropriate ones.
[testgrid](https://testgrid.k8s.io/sig-testing-canaries#pull-kubernetes-integration-go-canary), | ||
[latest | ||
prow](https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-integration-go-canary/1710565150676750336) | ||
* test with and without feature flags enabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update these links so they point to the exact tests verifying feature gate on and off? Are there any additional integration tests for this feature, if yes please provide here the necessary links.
We expect no non-infra related flakes in the last month as a GA graduation criteria. | ||
--> | ||
|
||
- Crashlooping container that restarts some number of times (ex 10 times), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the graduation criteria below you're mentioning e2e for alpha, can you update this section with appropriate links to specific tests?
|
||
No coordination needs to be done between the control plane and the nodes; all | ||
behavior changes are local to the kubelet component and its start up | ||
configuration. An n-3 kube-proxy, n-1kube-controller-manager, or n-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
configuration. An n-3 kube-proxy, n-1kube-controller-manager, or n-1 | |
configuration. An n-3 kube-proxy, n-1 kube-controller-manager, or n-1 |
and discussions with other contributors indicate that while little in core | ||
kubernetes does strict parsing, it's not well tested. At minimum as part of this | ||
implementation a test covering this for `KubeletConfiguration` objects will be | ||
included in the `config.validation_test` package. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this seems copied from the other KEP, were those tests added, if so can you link them here?
* `kubelet_pod_start_sli_duration_seconds` | ||
|
||
|
||
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can agree with the stateless fact, but I need those feature on/off tests linked in the previous sections. Then update this section mentioning that b/c it's stateless it's sufficient to verify that turning the feature gate on and off works as expected.
implementation difficulties, etc.). | ||
--> | ||
|
||
N/A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No
will be better answer.
--> | ||
|
||
Maybe! As containers could be restarting more, this may affect "Startup latency | ||
of schedulable stateless pods", "Startup latency of schedule stateful pods". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you performed any measurements as to how significant that degradation can be? Similarly how you provided rough estimations for increased CPU usage in the next question.